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I. INTRODUCTION 

During the preceding contract year, three tasks have been undertaken. 

They are: 

1) Applications Development System (ADS) Analysis, 

2) Algorithmic Development, 

and 3) Evaluation of Technical Reports. 

Task 1 (see Section II) consisted of a detailed study of the needs of EOD 
with respect to an applications development system (ADS) for the analysis 
of remotely sensed data; followed by an evaluation of four existing systems 
(ERIPS, ASTER, LARSYS batch, and LARSYS 3) with respect to these needs; and 
concluded with a set of recommendations as to possible courses for EOD to 
follow to obtain a viable ADS. Task 2 (see Section III) comprised several 
subtasks of which three were continuations of projects initiated during our 
first year’s contract. These include two algorithms for multivariate density 
estimation, a data smoothing algorithm, a method for optimally estimating 
prior probabilities of unclassified data, further applications of the modified 
Cholesky decomposition in various calculations, and a few other projects. 

Little effort was expended on task 3 (see Section IV) due to a shift in. priori- 
ties mostly necessitated by the increased effort devoted to task 1. However, 
two reports were reviewed. 

This report surmiarizes both the efforts and the findings of the above 
project. Each of the tasks are described in the following sections. 



II. TASK 1: Applications Development System Analysis 

This study (see the Task 1 Final Report) describes the results of a 
detailed study of the needs of EOD for an applications development system 
(ADS), including a detailed evaluation of four existing systems (ERIPS, 
ASTEP, LARSYS batch, and LARSYS 3) with respect to these needs. Suggested 
courses of action are proposed for the EOD to pursue. 

The original task definition in the contract called for: 

1) Developing a set of design goals for an applications 
development system (ADS), 

2) Evaluating ASTEP and the LARSYS batch programs to 
determine whether either met these goals, 

3) Recommending courses of action for the future of these 
systems: 

a) If neither system meets the design goals, 
develop a system design for an ADS that does; 

b) Determine the mutual impact of this system on 
either the IBM 360/75 under RTOS or the UNI VAC 
1110 under EXEC 8; 

c) Develop a recommended approach to the development 
of a data analysis ADS at JSC, 

4) Determining whether a terminal to ASTEP, a LARS terminal, 
or set of batch programs is the most desirable method of 
transporting remote sensing ADP technology to an agency 
establishing a program in remote sensing. 

This task definition was later modified to include ERIPS and LARSYS 3 
in the evaluation study and to exclude item 4 above. 
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It 1s important to remark here that some specific constraints of the 
EOD's were not taken into account in this study. Such considerations not 
involved in any quantitative fashion in this analysis include: 

1) cost of implementing recommended modifications, 

2) system oerformance and response as a function of number 
of users, 

3) availability and canacity of hardware, 
and 4) specific hardware implementations. 

The method employed for conducting this study was to adopt a "top-down" 
approach to the evaluation process. This consisted of: 

1) Developing design goals for an ideal ADS. These goals 
represent general areas of interest that such a system must 
address, and are not of themselves prioritizable. 

2) Detailing supporting design objectives for those design goals. 
These objectives are specfic functional capabilities that 

an ideal ADS should have. 

3) Prioritizing these design objectives with respect to the 
needs of EOD, 

4) Rating the various systems on each of the design objectives 
to indicate how well each satisfied the requirements of each 
objective. 

5) Recommending alternative courses of action for the EOD to 
follow based on the above findings. 
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Before discussinq further the methodology employed and the results of 
this task, it is imnortant to establish a framework for the ideal ADS. 

Such a system will be used to develop and test new algorithms and procedures 
for various remote sensing applications. Its basic characteristics should 
include that it be easy to use for a wide variety of personnel, accessible 
and responsive to users, reliable, and as flexible and complete a system 
as possible. The system must serve two kinds of users - production and 
techniques development personnel. The production user needs to be able 
to efficiently process large. amounts of data using state-of-the-art 
techniques. He requires that results be in form suitable for presentation 
or further analysis. The techniques development person, on the other hand, 
needs a system where he can thoroughly test and evaluate new algorithms and 
techniques. The system thus should be easily modifiable and require the 
user to have only a minimum of knowledge of the internals of the system. 

The user should be able to easily add, delete, replace, or modify any of 
the algorithms in use for his o\*in purposes, while assuring the integrity 
of the standard system. 

The design goals were established to present general areas that a 
TDS should address. These areas are, in summary: 

i) Combination of production and test systems in a unified framework, 

ii) Simplification techniques for system maintenance and enhancement, 

iii) Data and system management facilities, 

iv) Graceful degradation features, 

v) Convenience features, 

vi) System measurement and evaluation features, 

vii) Basic system analysis functions. 
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These qoals as such are not prioritlzable since they do not represent specific 
functional caoabilities. The design objectives, on the other hand, are 
prioritlzable, Thev are specific capabilities for an ideal ADS to have. 

These objectives v/ere then prioritized according to the needs of the EOD 
program objectives. Ratings of from one to four were assigned where the 
ratings are: 


Priority 


Description 


1 

2 

3 

4 


Necessary to achieve EOD program objectives 

Necessary to achieve a high level of EOD's 
program objectives 

Desirable feature 

Questionable desirability 


The various systems were then rated on each of the design objectives. Ratings 
were assigned from zero to five in the basis of how v^fell each system functional ly 
met each objective. The rating codes and their meanings are: 


Rating 

5 

4 

3 

2 

1 

0 


Meaning 

Exceeds requirements of this objective 

Meets all requirements of this objective 

Satisfies most of the requirements of this objective 

Satisfies some of the requirements of this objective 

Satisfies only a small portion of the requirements 
of this objective 

Does not have any such capability as specified 
by this objective. 


The results of this evaluation process comprise the bulk of the Task 1 final 
report. These findings are briefly summarized below. 
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ERIPS possesses several key features, notably an extensive, interactive 
imaging capability and an abundance of user convenience features. Though 
ERIPS is highly modular, it was not designed for modification by the user 
community: most of the coding is in a specially designed assembler language 

and the programming skill necessary to understand the internals of the 
system are far beyond the average user. 

ASTEP , on the other hand, was written mostly in PORTRAN V and the 
coding is relatively easy to decipher. However, no modification aids for 
the user are available and documentation is not very extensive. Though 
ASTEP can be run in an interactive mode, the use of tapes is limited by 
operational difficulties and, thus, system use is limited. Additionally, 
no interactive imaging capabilities exist. 

The LARSYS batch nrograms were also written mostly in FORTRAN V, 
However, very little documentation exists on these programs, thus making 
modification a difficult chore. The most serious problem with these 
programs is that though many functions are available, they are not in a 
unified system, which creates a myriad of problems for users and 
programmers alike. The lack of interactive and interactive imaging 
capabilities further hampers the utility of these programs. 

TARSYS 3 possesses many of the essential features of the ideal ADS. 

It too is written mostly in FORTRAN, and extensive documentation is readily 
available. A variety of modification aids eases somewhat the user's task, 
but other such features do not currently exist. The 'system is relatively 
easy to use, has several modes of operation, and a training program is 
available. An interactive, imaging device exists at Purdue, but none are 
supported elsewhere. It is presently lacking in basic systems analysis 
functions available, but the structure exists for later adding these. 



Thus, in terms of which system cones closest to meeting the requirements 
foranADS. LARSYS 3 appears to be the most suitable in principle. If LARSYS 3 
is to be used most effectively as an ADS, it is worth examining what modifications 
are necessary to further enhance its utility, and how difficult would such 
modifications be to make. This study would indicate that modifying LARSYS 3 
in several specific areas would produce an ADS v/hich would satisfy most of 
the needs of EOD. The major areas of modification include adding more 
analysis functions, adding a more extensive imaging capability, improving 
the modifiability characteristics of the system, probably converting the 
system to run under the IBM Time Sharing Option (TSO), and installing it 
on IBM 370/158 or 168. These latter two modifications are to allow EOD to 
have their own system locally with mainline IBM support and file compatibility 
with other IBM computers (because of using TSO rather than CMS). Compared 
with operating remotely from Purdue, this would eliminate difficult problems 
of supporting remote interactive imaging devices, transferring bulk data 
over long distances, configuration control and future growth of the system, 
and overloading the system at Purdue. This may well represent the best 
courses of action for EOD in terms of capabilities for satisfying their 
needs for an ADS. 

If the above is not possible, one alternative method would be to provide 
an interactive image display tied into the LARSYS 3 system at Purdue. This 
would require intelligent (perhaps specially designed) terminals to effectively 
provide this ability over the long distances involved, and high bandwidth 
communication lines. Other modifications to the system as suggested above 
could be made to LARSYS to increase its utility. However, difficulties may 
be encountered in the areas of overloading the system and transportation of 
data back and forth. Such a configuration would have a substantially lower 
throughput and turnaround capacity, but may be suitable for relatively low 
volume demand. 
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Two other possiblities for an interactive ADS suggest themselves: 
build an entirely new system based on the ideal design goals and objectives 
contained in the Task 1 report, or radically modify the internals of ERIPS. 

Developing a new system based on the established design objectives would 
be a very costly project both in time and money, but it would probably 
provide a very effective means of doing techniques development work. 
Modifications necessary to effectively utilize ERIPS as an ADS consist of 
establishing terminals in Building 17 and providing users with the capability 
to work with the internals of the system. The latter would entail re- 
programming all algorithmic routines into high level language; providing 
interfaces to other system routines which would allow users to perform 
such tasks as menu generation using only the high level language; and 
adding numerous other capabilities to the system. We do not highly 
recommend this approach since it appears that a relatively large amount 
of effort must be expended, and the resulting system would still not be 
entirely satisfactory from the modifiability standpoint. 

Modification of either ASTEP or LARSYS batch is not recommended. The 
basic structures of both of these would not be able to accomodate the 
necessary modifications. However, parts of these systems, particularly 
some of the algorithms, could be used with minor modifications in developing 
a new system or as additional functions in LARSYS 3. 
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III. TASK 2,; Algorithmic Development 
Non-parametri c density estimation : 

Two methods of attacking this problem are being investigated. The aim of 
these projects is to provide a computationany viable way of estimating 
multivariate density functions from relatively small sample sizes, and then 
to devise a classifier using this model rather than the standard Gaussian 
one. Such a classifier could significantly increase classification accuracy 
and also enable one to avoid splitting and later recombining multimodal classes. 

The present efforts are directed towards estimating the densities with 
little concern being given to computational efficiency. It is felt that 
once such algorithms can be effectively used, methods for greatly increasing 
their efficiency will be developed or special purpose hardware could be 
designed. 

a) This algorithm (see the "Estimation of Multivariate Probability Density 
Functions Using B-Splines" by J. 0. Bennett) estimates a p-dimensional 
density function given n random p- vectors of data. The data is first 
transformed to make it pseudo-independent (the covariance matrices are 
transformed into identity matrix ), Then a p-dimensional density kernel 
estimator is used with a n-fold tensor product of B-splines as basis functions. 
The estimator is proven to be consistent in the integrated mean square error 
sense. 

This method developed from an earlier algorithm - spline smoothing of 
histograms. The difficulty v/ith the previous method was that in many 
dimensions, histogranming becomes an arduous task because of the number 
of bins Involved. Thus, though this algorithm functioned quite V'^e 11 in 
one dimension, the results were not readily extendable to the multi- 
dimensional case. The new algorithm avoids the histogramming problem entirely. 
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All of this leads to an algorithm which yields a "good" estimate of a 
multivariate density function even with small sample sizes of training data. 

This algorithm has been implemented and tested using random numbers from a 
variety of distributions. Performance has been quite satisfactory. It 
has also been installed in the version of LARSYS operating on Rice University's 
IBM 370/155 to compare it with Gaussian maximum likelihood classifier. 

Results of these tests show that, though the B-spline estimator is relatively 
slow, its performance on "Gaussian-like" data is comparable to a Gaussian 
estimator; whereas on other distributions (e.g., bimodal), its performance 
is significantly better. 

The algorithm as presently implemented in the classification section 
of LARSYS is slow. This is mainly due to the fact that the estimate of 
the value of the density function of a class for an arbitrary data point 
involves (l) a rotation of the data vector, and (2) the calculation of the 
value of a cubic B-spline for each dimension and each data vector from the 
training samples. For large training sample sizes, the second of these 
features can entail a very large amount of computation. However, a few 
points suggest ways for alleviating this problem. First of all, cubic 
B-splines have finite support and thus need not always be explicitly 
evaluated. Also, schemes for ordering the training data can be used to 
avoid performing many of the computations. In addition, for many applications, 
one can use linear basis functions instead of cubic, thus considerably reducing 
the number of computations necessary. 

At this point, we can suggest guarded optimism for the applicability and 
usefulness of this algorithm in remote sensing applications. More testing with 
ronote sensing data needs to be done to determine how generally successful the 
algorithm is in the environment. Improvements to this algorithm are currently 
being investigated. 
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b) In this study we consider the problem of estimating the probability functions 
t\a,b] which gave rise to the random samples "(x-j,X 2 , ,x^ ^ 

The interval [a,b]; may be either infinite or finite. 


Recall that bv L(v) , the likelihood that v . L^[a,b] 

the s armies ... ,x K we mean 

1 ^ n 


N 

L(v) = . n v(xj 
* 1 • 


gave rise to 


Let S be a manifold in L^[a,b] . By the maximum likelihood estimate 

corresponding to the samples ^ and the manifold S, we mean the 

solution of the following optimization problem: 


maximize L(v); subject to 
V c S, v(t) >0 V t i [a»b] ^nd 



dt = 


1 


It is well-known that the Parametric likelihood estimate (S is finite dimensional ) 
is well defined. However, a finite dimensional manifold does not approximate 
well. Hence it makes sense to consider nonparametric maximum likelihood 
estimation (infinite dimensional S). Clearly, if the manifold S can approximate 

f 

the Dirac delta function, i.e., contains nonneqative functions whose support 
is a given small interval centered at x e [a,b] , integrate to 1 and have 

arbitrarily large values at x, then our optimization problem has no solution. 
Moreover ,• this approximation property is enjoyed by most infinite dimensional ' 
manifolds in L^[a,b] i hence, we should not expect the nonparametric 

maximium likelihood estimation problem to have a solution. The situation in 
present-day applications is actually worse, for it Is often the case that in 
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the parametric case we choose S from a seauence of manifolds j* 

QD 

where the dimension of \ is m, e S^^-| and is 

dense in L^[a,b] ; hence the problem is definitely unstable and 

somewhat ill -defined. Namely we are motivated to choose m large so that 
we can better approximate the probability density giving rise to the 
samples; hov/ever, for a large m the problem approximates a problem which 
has no_ solution. 

The previous remarks motivated the maximum penalized likelihood 
estimate; which consists of replacing the functional L in our optimization 

A 

problem with the functional L defined by 

A N 2 

L(v) = n v(x.) exp (-1 lv| I ) 

i = l 

where the norm 1 1* I ! is some appropriate norm on the manifold S. We 
consider many interesting maximum penalized likelihood estimators and 
show that they are well defined. We also show that some maximum penalized 
likelihood estimators are splines and give some numerical examples. 

(See the technical report "Nonparametric Maximum Likelihood Estimation 
of Probability Densities by Penalty Function Methods" by G. F. de Montricher, 
R. A. Tapia, and J. R. Thompson), 

Use of Soatial Information : 

Using interpolation polynomials of odd degree, a method of detecting and 
correcting errors in equally spaced data is presented. This method permits 
one point to be corrected without contaminating good points. To each point, 
the method associates an error that measures the distance between this point 
and the polynomial that interpolates certain neighboring points. By 
selecting the points with the largest error and moving it so that the error 
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decreases* a smoother set of data is produced. The method is said to be 
local because if just one ooint is bad, it is going to be detected and 
corrected without disturbing its neighbors. By successively selecting 
the points with the greatest error and modifying them, smoother data is 
produced. To measure smoothness and to give another interpretation to 
the error of point, we use the fact that the distance between a point and 
the polynomial of degree 2k- 1 that interpolates its 2k neighbors is the 
2k- th divided difference. Therefore, by smoothness, we understand the 
summation of the squares of the k-th divided differences of each point, 
and by moving the point with the greatest 2k-th divided difference, the 
smoothness will decrease the most. It is proved that ultimately the method 
converges, therefore the possibility, of moving one point back and forth 
is excluded. 

By preprocessing some of the data, a threshold error can be found 
beyond which data is said to be smooth. This error is given in terms of 
the ratio between the average of all the point errors of the original 
data and maximum point error in each iteration. Another interesting 
feature of this method is that it simulates the fairing or smoothing of 
data points as performed by a human. In such cases, the method will smooth 
the data until the error associated with each point is impossible to detect 
by human sight. The value of this minimum error is furnished by modern 
psychology. 

This method, when applied to C-1 flight line data, improved the 
performance of classification in each of the different classes considered 
by approximately 5 percentage points. Here each channel and line of pixels 
were independently smoothed, but the method is readily extendable to smoothing 
in both physical dimensions. 
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It was then compared with spline smoothing. In addition to being 
considerably faster, our method yields more consistent improvement in 
accuracy whereas spline smoothing sometimes oversmoothes the data (see the 
report "Error Detection and Data Smoothing Based on Local Procedures" by 
V. M. Guerra). 

A new approach to the problem of estimating proportions : 

The problem of estimating acreages or proportions of the several crons 
under consideration has received attention recently in the research literature. 
In some acreage estimation problems, it is realistic to assume that the 
signatures of the several classes have well-known statistics, while their 
proportions are unknown. In estimating the acreage of wheat, for example, 
we can model the problem as a two class case, where in class 1 we have wheat, 
and in class 2 everything else. The basic difficulty in estimating acreages 
lies in the fact that the estimate must be based on unclassified noisy data. 

If the data are classified with zero error probability, the problem is 
trivial, and a simple "relative frequency" estimate is intuitively and 
theoretically satisfying. In order to have Bayes classification rule, 
knowledge of the prior probabilities (or proportions) is necessary. On the 
other hand, in order to estimate the prior probabilities we need to classify 
a sufficient amount of data. Hence, in order to have a decent performance 
in classification and estimation of priors, it is profitable to look at the 
coupled problem of Bayes classification and simultaneous estimation of 
prior probabilities. 

A report has been published on this subject, with the title "Optimal 
Design with Unknown Priors" by D. Kazakos. In this report, a sequential 
scheme for simultaneous classification of data and updating the estimated 
prior probabilities is proposed and analyzed. The probability density 
functions under each of the M classes are assumed known, and the prior 
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probabilities are assumed unknown and sequentially estimated. It is proved 
that the scheme converges to the true value of the prior probabilities, 
and hence the adaptive classification scheme converges to the Bayes classifier. 
Furthermore, a significant property of the scheme is that the error variance 
of the estimate of the prior probabilities converges to zero as , 

where N = number of observations. This is significant, because even if we 
had a set of N perfectly classified observations * the error variance of the 
relative frequency estimate would converge to zero as N"^ . The recursive 

form of the estimation scheme makes it attractive for situations where the 
proportions are varying. The method can "track*' slowly varying proportion 
vectors. 

Other variations of the proposed method are currently under investigation. 
In a forthcoming report, theoretical and numerical comnarisons of several 
related proportion estimation methods will be presented. 

Numerical Optimization of Algorithms : 

This task is concerned with developing numerically optimal algorithms for 
use in remote sensing analysis. It is our view that the algorithms employed 
in remote sensing applications be as accurate as possible since use of 
unreliable algorithms can lead to inaccurate results, and, possibly, very 
erroneous interpretations. Such difficulties might be very hard to detect 
when they occur and could cause considerable delays. Also, the algorithms 
used should be as efficient as possible to conserve computer time and thus 
the resources of the project. In the past, we have shown how the modified 
Cholesfcy decomposition (MCD) may be employed in many computations where the 
covariance matrices and their inverses are used. This has effected a 
considerable savings in computation time and increased accuracy over algorithms 
previously in use. Memorandum on the following applications of the MCD are 
included in this report: 
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1) Computing the average weighted divergence 

2) Computing the interclass divergence (for use in calculating the 
transformed divergence) 

3) Performing feature selection using D. Tebbe's criterion. 

Feature Selection : 

Another minor project in this task is to devise an algorithm for performing 
feature selection (or extraction) using the condition numbers of the covariance 
matrices as a measure of the separability of the classes. The condition number 
of a matrix is defined by 

cond (A) = 1 I A1 ) ! 1 A’*'* I 1 

where larger condition numbers represent matrices whose rows (columns) are 
more nearly linearly dependent. Then the rationale for applying this measure 
to the feature selection problem is to find the subspaces containing most 
of the information and thus to have the rows (columns) of the covariance 
matrices be as orthogonal as possible. 

A program was written to select subsets of channels from Cl data and 
to pick a subset of a specified size that minimized the maximum condition 
number of the covariance matrices. Classification was then run using 
subsets selected by the average divergence criterion (a d c). Performance 
decreased relative to using channels selected by the a d c. As a next step, 
we hope to examine how the condition numbers vary as the a d c selects 
subsets, hopefully to gain insight into what are suitable criteria to employ 
when using condition numbers as a distance measure. 
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IV. TASK 3: Review of Technical Reports 

G. Austin's memos "Analysis of LARS Subroutine CLASS and Recommended Coding 
Improvements to Reduce Its Execution Time" and "Modifications to ERIPS 
Requirements" have been reviewed. A change in priorities prevented other 
reports from being reviewed. 
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RICE UNIVERSITY 

Institute tor Computer Services 
and Applications 

MEMORANDUM 

DATE: Aug. 15, 1973 

TO: K. Baker 

FROM: D. L. Van Rooy 

Use of the Modified Cholesky Decomposition in Interclass Divergence 
Calculation. 

To compute the transformed divergence, the interclass divergence 
D ( i , j ) is needed. An efficient and numerically stable method for computing 
the D ( i , j ) is to employ the modified Cholesky decomposition of the covariance 
matrices. This note will derive the appropriate expressions for doing this. 



I. Derivation 


The interclass divergence D (i, j) is given by 


with 


and 


D(i. j) 

= ( i , j 

i) + D2<i,j) 

(U) 


= i tr j 

JKi-lCj) (Kjl-Kj-l) ] 

( 12 ) 


= 1 tr 1 

[(Kjl + lC-b (U--V..) (Ui-Uj)*] 

(1.3) 

K. is the 

i 

covariance matrix ; p. , the corresponding mean 



victor, and tr denotes the trace. Dj^ can now be simplified to 


Dj = i tr (K.kT’) + i tr (K.Kr^) - n 


-1 


(1-4) 


where n is the order of the K^'s. Now since the K^’s are symmetric, 
positive- definite, we may write (modified Cholesky decomposition). 


/f- 


y 



1.2 


K. = L . G. L. 


where is unit lower triangular and is diagonal. So we can write 

= tr[L:*LiG,L* L*-'g:'] 

= ,,.5) 


where 


= L»j and also is unit lower triangular. So (1.5) and 

a similar expression may be used to calculate , using eq. (1.4) 
Now for ,we may rewrite (1. 3) as 


= 


= i [ (Ui - u.)* (kI^ + k‘^) (uj - Uj) ] 


( 1 . 6 ) 


Again, then we may use 


and define 


K. = L. G. L. 
1 111 


m.. = u- - u. 
13 1 n 


So (1.6) may be rewritten as 


-rsjt ^~1“1-1 -1 

'2 = i [m.. (Lj G. Lj + L. G. L. ) ] 


*-l -1 “1 


if* 5^ 4=“* 1“1“1 “ 1 “ 1 “I 

= i [m.. L. G. L. m.. + m.. L. G. L. m.. J 


* _^"1-1 _“1 


i Fx^G’^x. + xf G.^ X. 1 

L 1 1 1 3 3 J J 


(1.7) 



where 


and 



, -I 

X. = L. m.. 
J J iJ 
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II. Computatji Ti of the D( i , j ) 

To calculate the interclass divergence D( i , j ) , first calculate 

the modified Cholesky decomposition of the two covariance matrices K. 

and K. ^ 

J 

K. = L.G.L.* 

K. = L. G. Lf 
3 J J j 

where L. and L. are unit lower triangular matrices and G. and G. are 

^ J i j 

diagonal matrices. Using the notation K. = { } , L. = } 

and G. = -[ we can write 


i = 1 
ss 


s = 1, 2, ... n 


( 2 . 1 ) 


= 


g = 


11 


s “ 1 


rs 


4 ( 4 )' 

P=1 

(’'rs - 1 4^i>^sp)/gsJ 

p = l 

r = s+l, s + 2, ... n 


I s = l, 2, ... n 


with = 0 for s > r. Similar expressions hold for the elements of 

L and G. . Next we need the elements of T.. and T.. { T and 

J J ji 11 ** 

{t’’ I 

^ rs J 




1 


-1 I (2.2) 

r, r-l r, r-1 r, r-1 

, r-1 , .. } r = l, 2, ... n 

ji i j r» j J‘ 

= - I 


rs 


rs rs lu rp ps 
p=s+l ^ 


s = r“2, r- 3, ... 1 


ii ij 

and = 0 for s > r. Similar expressions hold for t 


rs 


Now form 


m.. = Vi. - u. 

ij 1 } 


( 2 , 3 ) 


and 


r- 1 


X = 


IJ Kp A i 

‘"r ~ I 

p=l 

r- 1 


J iJ 


X' = m 


- y i X 

r L rp p 

p= 1 


(2. 4) 


r = 1, 2, ... n 


where } , x = { } 


Now D ( i , j ) is given by 


n r r 


.. 2 


D(i,j) = i Z t [(4) 4/sr + (^rJ S^Arl 
r = l I j6=1 


+ 


(\ )V 4 + (4)V 


Sr ^ " 


(2. 5) 


So the steps to form D( i , j ) are 


1. Form L. , , and using eqs, (2. 1) 



Form 

Form 

Form 


Tj. and T.j using eqs. (2.2) 
m. . using eq. (2. 3) 


X. and X. 
1 J 


using eqs. 


(2.4) 


Calculate D(i,j) from eq. (2.5) 



RICE UNIVERSITY 

Institute for Computer Services 
and Applications 


MEMORANDUM 



DATE; September 21, 1973 
TO: K, Baker ( 1 A 

FROM: D. L. Van Rooy 
RE: Use of the Cholesky decomposition in D. Tebbe’s feature selection 



Analysis 

Tebbe's method of feature selection consists of using a without replace- 
ment procedure for picking features, and classifying training fields to 
"determine” the probability of correct classification. At first, this may 
seem to be a quite time-consuming method. However, two points serve 
to expedite the procedure, 

(1) the without replacement procedure greatly 
reduces the number of feature combinations 
to be used 

and 


(2) by partitioning the covariance matrices and 
judiciously saving appropriate results, the 
amount of computation may be* greatly reduced 
both in computing the inverses of these matrices 
and classifying the training elements. 

In his example, computation time for this method has been comparable 
to the exhaustive search, without- replacement divergence calculations, 
while classification accuracies have been greater than or equal to those 
using this diver'^ence computation results. 

The puxpose of this note is to show how the execution time of Tebbe's 
method may be significantly decreased by employing the modified Cholesky 
decomposition^^^ . We have 

K = LDL* 

where K is the covariance matrix, L is unit lower triangular (i.e. 1. = x, 
j&y =■• 0 for j > i ), D is a diagonal matrix and * denotes - ^ 
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transp>ose. The logarithm of the density function of a class (within a 
constant) is 


f (x) = I ^ 


K 


+ 4(x - ti)* (x-u) 


( 1 ) 


where u is the corresponding mean vector. So there are two problems 
to be attacked: 

(i) how to economically obtain and ' 
given 

subscript denotes the order of the matrix 


and 


^3 = 



i. e. is a submatrix of (because 

of the without replacement procedure), 

V is a vector, and o a scalar. 


and 

(ii) given L and D how does one economically 
evaluate eq. (1). 

Now it is easy to show that 

i' 1 


^(n) = -Y H U. 

m \ m ^ ij J nj// i 

i = i 


ri=i. 2, . 


jj(n) _ ^(n-1) 

ij ij 

,(n)\ 


j = 1. 2, ... i 


J 


(2a) 
.. n-1 
(2b) 


where | and similaxdy for K and D. 


also 


d(n) 

n 


= - 


nn 


/ \2 , _ 

y ^(^-1) 

^ np p 
p = l 




(3a) 
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and 


,(n) .("‘1) 

d = d 


i = 1, 2, . . . n ' 1 


(3b) 


Thus L differs from L ^ only in the last row, and D. differs from 
n n-i n 

|_r. 

j only in the nn^^ position. 

We note that 

I 

K. 


n 


L D L* 
n n n 




n 


D 


L* 

n 

1 


n 


D 


K. 


n 


= d 


n 

(n) 

n 


D 


n-1 


( 4 ) 


Rewriting eq. (1) we have 


f(x) “ 


K- 1+ i ^ 6. + i y 

n-1 n ' ^ n n 


( 5 ) 


where y = x - jj 


Defining z = L y 



( 2 \ 

with z' ^ a scalar 


we obtain for the last term in (5) 
2* D„^ z 


n 




2 (n) 





So (5) becomes 
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f(x) = 4 (to 



+ 4(«4"> + (zW)Yd;,">) (6) 


with 


z. 

1 


i - 1 



j=i 


,(n-l) 

Sf** 

IJ 


z. 

J 


z 


n 


n 


n - 1 

- Z 


j = l 




i = 1, 2, ... n-1 (7a) 


(7b) 


So we wee that the first two terms in eq. (6) do not depend on any values 

th 

from the n channel. Thus they may be precomputed and used with each 
of the channels. 


Comparing these results with Tebbe's, we note that the classification 
of each point will require the same amount of computation after the leading 
term (his ^ ) has been computed. However, the proposed method 

O 


is faster and more numerically stable since 


1) 

2 ) 


no matrix inverses need be computed, and 

the calculation of f in Tebbe's method requires 
2 

n multiplications whereas the corresponding 
calculation of requires ^ n^ multiplications. 
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Implementation 


The above results can be readily utilized in the existing program. The 
following describes the changes necessar}^(see ref. (1) ) : 

^ -I 

J.) Tebbe’s Y is not computed. Instead, the modified 

o 

Cholesky decomposition ( ~ L DL ) is computed (see 

ref (2)) (it could be saved from the prior case v/here the 

best n-1 channels were computed). This yields the 

and d ^ used in eqs. (2a) & (2b). 
ij ) 


(n) 


2) In place of the calculations of his e find f will be ^ . 

(n) 

i = l,2, ... n-1 and from eqs. (2a) & (3a) in 

(n-1) (n-1) 

this report. (Note that the product in 

eq.- 2a ma 5^ be precomputed). 

(n) 


3) His 071 - e becomes d 


n 


4) His S is now , 

o n-1 


n-1 


wliere 


n-1 


22 ,("■!) , (D* -1 (1^ 

= nr d. and D' ^ - 

i=i ^ 


n- 1 


7 1 
n-1 


I -Uh 


i= 1 


and z. is given by eq. (7a) 


5) The term ( f ' - 6 )^ in S in his report becomes 


( 2 ) 

where z^ * is given by eq. (7b). 





Keferences 


Dennis L. Tebbe, ’’Feature Selection Software for Pattern Recognition 
applied to Atultispectral Earth Resources l>ata^ ” NASA - EOD internal 
maino, August 10, 1973. 


D. L. Van Kooy, MAS. Lynn, and C. H. Snyder, - ’^The Use of the 
Modified Cholesky Decomposition in Divergence and Classification 
Calculations, " IGSA Technical Report #275-025-008, Rice 
University, Houston, Texas, May, 1973. 



DATE: 

TO: 

FROM: 

RE: 


RICE UNIVERSn 


Institute for Computer Services 
and Applications 

MEMOBANDUM 

November 30, 1.973 
Ken B 

D. L. 

Improved method for computing the ave]*age weighted divergence 



The average weighted divergence may be written, following Quirein^^\ 
as : 


m 

m-1 

m 

Z tr{Kp S.) 

- " y 

y w,. 


L ij 

i-T 

i = l 



(I) 


where K. is the covariance matrix 

m is the number of classes 
n is the dimensionality 


and 

S. = 
1 



( 2 ) 


with W . . 
ij 


being the weighting factors 


and 6 . ■ 
ij 


being the difference of the means between ’ ' 
classes i and j ^ 
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Now we note that both the K. and S. are syininetric, poG in v'^e- definite 
matrices. Thus we may write 



K. 

1 

11 

r 

o 

M- 

(modified Cl loIesUy decomposition) 

(3a) 

and 

S. 

1 

^ K. + Rf 

3. 1 

(3b) 

where 

L. 

1 

and R. 
1 

are lower triangular and G. a dir.gonal 


matrix. 

So the term 

IC^ ^ S. in eq. (I) becomes 




-1 

1C. 

1 

s. = Lpi g:^ l:’ (r. +Rp 



tr(K- S.) = tr(G-l R. L^l + R: ) 

= tr (Q. + Q«) 

= 2tr(Q.) 


where 


Qi = Gj* l:’ Rj Lf-1 


So now eq. (1) becomes 


m - I m 

D = J tr(Q.) - n I ^ W.. 

i 1 i = 1 I = i -h 1 


We note tliat onl}^ the diagonal elements of Q. are needed, 
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where 



First form 


( 6 ) 



which requires 


-1 

Li Ri where Tj is upper triangular 

IL(£±l)_(.'V^2) lies 

6 


Tlien compute the diagonal elements of L j^ T. = C. . This also 
n (n+1) (n+2) 


1 1 


multiplies. So eq. (6) becomes 


6 
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Thus the total number of multiplies involved is 


2 n ( 11+ 1 ) { n+2 ) 


/ 2n' N 
which is O ^ cj ) , 


+ n 


including the multiplications necessary to perform the modified Cholesky 
decomposition. 


Algorithm 


1). Compute the S^‘s according to eq. (2) using all c nannels 
Also compute the last term in eq. (5) . 


2), Form the R.'s using eq. (3b) . 


i.e. r^, = 

jk jk 




= isg ^ 

Jj 3 ] 


r = 0 


3 > k 


3) . For particular conbinations of channels .pick out the s , bmatrices 

K* and 

4) . Form L. and Gj as in eq. (3a) . 
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i , 1 

g = k.. 

• n 

J JJ 


i /i ^ 

Vj vVj 


. / i 5 

Z < (v) 

u-i 


i 1 1 

) g 

^ u vu 1 


U = 1 


V = j+1 , j+2 , ... 11 


I j = 1 , 2 , . . . n 



1 

g. 

J 


i i 

with = 1 and , = 0 for i > v 
J3 vj 


5). Compute the upper triangular elements of using 


t^. = rl. 
33 j3 


j = 1 , 2 , ... n 


.1 1 

*•1, = 1^1, ■ 
3k kj 


k - 1 




u = 1 


ku uj 


> k = 1, 2, ... n-1 


j = kH'l , k + 2 , . . , n 


6). Compute the following elements of C. 


1 1 

c = t 
jk jk 


j - 1 

ju uk 


U = 1 


k = 


3 = 1 , 2, ... k 


1,2,.. n 



N. B. only the diagonal elements of are needed but the ot iiers 

are necessaiy for the cal collation of these elements. J 


7). Form 


D = 


m 

n 

/ 

m - 

1 

m 



/ i 

- n I 


V 

I 

1 hi 

/h ■ 


L 

i= 1 

j=i 

i = 

1 

j-iH 


ij 


the average weiglited divergence. 
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